Image compression with KMeans

This notebook contains analysis of image compression with KMeans with different k, initiaton initiation methods and distance type (Manhattan and Euclidean).

Initiation methods explored are:

Compressed images can be found in '/data/pictures/compressed/' folder. Results can be found in '/results/image_compression/' folder. Results indcluded are:

Each of the result and compression are done by repeating KMeans 5 times and picking the lowest loss.

1. Convergence time

We will compare convergence time of KMeans image compression of 3 images with different k. Number of iterations as well as initialization and optimization time are compared (for simplicity, only Euclidean method with KMeans++ (https://en.wikipedia.org/wiki/K-means%2B%2B) and random_uniform initialization (where random k datapoints are selected as initial centroids) are compared). All times are in seconds.

KMeans++ runtimes are dominated by initialization times since it needs to loop through each datapoints and picked centroids to pick the next. However, optimization times can be shorter than random methods. In general runtime of KMeans clustering increases with K while number iterations can be lower for higher K due to faster convergence.

2. Compressed images

For simplicity, only pictures for random_uniform initialization method and l2 distance type are analyzed.

With k = 16 compressed pictures start to look very similar to original.

3. Importance of initialization strategy

In this section we will analyze an initialization strategy deliberately devised to be poor, called inverted_plusplus which is similar to KMeans++ but with subsequent datapoints having higher probability of being picked as the next centroids if their minimum distances to the already-picked centroids are small. This should encourage initial centroids being in a close proximity to one another.

First we compare this startegy to the random_uniform and KMeans++ in Euclidean distance type.

The quality of compressed images is not noticeably different between the three initialization methods (the same result can be seen from the other two images). However, if we look at the runtimes for all three images:

We can see that the poor initialization method tends to have the longest total runtime: long initialization time (owing to its similarity to KMeans++) without the benefit of faster convergence. This is also reflected by the number of iterations needed to converge particularly at higher k.

4. Manhattan (l1) distance

4.1. Convergence times

Similar to Section 1, we will compare random_uniform and KMeans++ convergence times with l1 distance.

Obviously the same trend still exists. However, l1 optimization seems to converge faster that l2.

4.2 Compressed images

Interestingly, there seems to be a trend that l1 method does not pick up hues as well as l2 method. This is particularly visible in GeorgiaTech picture especially at k=4 and k=8. Here we will look at random_uniform initialized compressed images, however, the trend is also present in the three other initialization methods.

This phenonemon perhaps can be explained by the intuition that Manhattan method punishes distance linearly as opposed to Euclidean method where large distance in one axes (with small distance in the other two axes/color components) dominates and is punished more severely by squaring. As a result, minimization of l1 distance by median is more forgiving towards datapoints which have high similarities in 2 colors/axes while having much lower similarity in the third axis/color.